I. Introduction

The Long Short-Term Memory (LSTM) models is an extremely powerful model for predicting time-series type of data. [0] Compared with standard and traditional moving average models, it can predict an arbitrary number of steps into the future, which most moving average models can’t do. Moreover, it is a development from the recurrent neural networks (RNN). RNNs use previous state of the hidden neurons to learn the current state given the new input, while LSTM has an extra cell added to help RNN better memorize the long-term context. In the stock prediction problems, it helps store past short-term and long-term prices information, it is significantly essential since the previous price of the stock is crucial in predicting its future price.

In this work, we present the application of LSTM on S&P500 stock market prediction. Here are some significant contributions of this work:

  1. We created highly flexible stacked LSTM algorithm capturing dynamic stock data every minute from API and Wikipedia and making stock price prediction.

  2. Tuned the 2-stacked LSTM model with optimal hyper-parameters among 20+ candidate models with TensorFlow.

  3. Scraped data from online news to help make predictions. These news inputs are also dynamic and updates automatically.

  4. Our alogirthm can track all S&P500 comapnies at the same time

Also, four main sections are contained later in this paper. Firstly, we introduce the methodology and training steps behind our optimal 2-stacked LSTM model. Then, we demonstrate our experimental results from multiple types of RNN-LSTM candidate models. Moreover, we will compare the performance with or without sentiment news data.Finally, we will also present the error analysis based on them.

  1. Candidate models
  2. Error analysis
  3. Discussions about our works.

II. Data

Quantitative Dataset:

The time series daily stock price data comes from a python API package which can track stock price to a minute interval. The input dataset contains the 4976 days’ prices information in S&P500 stock market and currency exchange rate. More specifically, daily closed index of S&P500, daily close price of 505 companies in S&P500 stock market, daily closed price of Bitcoin and daily closed exchange rate of Chinese Yuan to US Dollars. The timeline is from 2000-01-03 to 2019-10-10. We store this data into the “josn” file. Lastly, by using the proposed model, we predict S&P500 index of a specific day. One thing that we can do further is to try with adjusted S&P 500 stock prices that strip out the effect of dividends and acquisitions.

S&P 500 Over all

S&P 500 Over all

Sentiment Dataset

The sentiment news data comes from another python API which track political and financial news on a 15 minutes interval. We used the summarization of raw text. The data contains 3 million news from year 2015 to 2020. [2]. The sentiment dataset is mainly about political news. These news coverage from Associated Press,BBC, Washington Post, New York Times ,Google News, etc. this API filtered non-political news. The news is then classified with different actors and actions in the news. Such as Police Forces, Government, Military, and etc. Action such as appeal, visit, consult.. etc. After that, a sentiment score is calculated as positive or negative tonal assessment of an article based on term frequency, where each words already have a previfined sentiment score, then we take average for the article.

Example of processed sentiment data

Example of processed sentiment data

Data Processing

We scale the data by using natural log transformation, for this sequence data, we chose 120 days as our prediction window. In a word, prices in a 120 days window are used to predict the S&P500 index on the 121st day. In addition, we fill missing values of our numerical dataset by the stock price on next day, recurrently.

For sentiment dataset, We tried using US-CHINA relation news to help prediction. The choose of US-CHINA news is arbitrary. We then filtered unrelated news, only US-CHINA relation news are kept. which is actors in [“US”, “CHINA”] to help prediction. We then pad the news as inputs each day to a fix number of five. The choice of five is based on the total 3659 US-CHINA news remain.

Saturdays and Sundays are removed from our sequence, we view Monday as the subsequent day of Friday, as stocks are trades only on workdays. This setting is flexible and can be changed easily if it cause any bias. Then both sequence dataset and sentiment dataset are transformed into tenosr inputs. We developed a pipline that can automatically update our dataset. In addition, our dateset dosen’t contain any missing values so we jump over the data cleaning step.

Quantitative Data processing

Quantitative Data processing

III. Methodology and Architecture

3.1 Training Step

Before we moved to sentiment dataset, firstly, we want to setup a baseline model which only includes the numeric dataset. Then, We incroporate the text dataset in the later work, as the sentiment prediction model is built upon our baseline model.The training time for each baseline model, with a A RTX 2080 Super, is within 5 minutes, which we considered fast, and time affordable.

After the dataset is transformed into tensors, it is then divided into training and testing sets to evaluate. The split rate is 0.8, 80 % data used to train the LSTM model and 20 % used to validate the LSTM model. In this work, we mainly focus on 5 hyperparameters of the LSTM model: epoch, batch sizes, dropout rate, LSTM- units, and number of stacks.

3.2 Candidate Models

We have 27 LSTM candidate models with different values of hyperparameters. We tested the combinations of the batch size, LSTM units, dropout rate and LSTM stack:

Epoch, the number of times that we useded all data in training: from 1 to150

Batch size, the number of datasets we feed into the algorithm to update weight each time: [64,128,256]

LSTM units, the number of nodes in the lstm layer: [80,120,160]

Dropout rate, the percentage of nodes be randomly setted to 0 weight during training: [0,0.25,0.5]

The number of stack,the number of LSTM layers are stacked together: [1,2,3]

3.3 Error Analysis for Training and Testing sets:

For each hyper-parameters, we recorded the model performance. In our work, we used the mean square error (MSE) to evaluate and compare models. We recorded the MSE after each number of epoch run, then plot the MSE with respect to number of epochs run under each hyper-parameters settings. Based on the error analysis, which comes from training vs. validating loss result after each epoch, we judged the performance of our models under different hyper-parameters settings. Then we choosed the model with the hyper-parameters in the experiment as our baseline model.

Expriement 1: Examine Batch Size Effect

First, we exam the best batch size values during the training process. There are a lot of paper talking about how to choose an optimal batch size. A empirical rule is to start out with 32, then double it or cut it in half. We fixed other hyer-parameters to LSTM units120, Dropout rate 0 % , and a stack of two lstm layers. Then compare the effects of just batch size. From the graph on next page, we can see, when batch size is 256, the train and validation losses are both overall at a very low level. Also, the volatility of validation loss is smaller when batch size equal 256. Thus, batch size=256 has the best performance among all.

Batch size effect on traning/validating loss

Batch size effect on traning/validating loss

Expriement 2: Examine Drop-out Effect

Then, we tried dropout rate 0.5 and 0.25 both for different batch size 64,128, and 256 to train the data. With dropout added, we can see the validation loss has a significant decrease for all situations (batch size = 68,128, and 256). Consistently, batch size = 256 still has a lowest loss when dropout rate is 0.5 or 0.25. However, when batch size = 256 and dropout rate = 0.5, according to the plot, it shows an hight overall loss, and volitility. Therefore, we concluded that LSTM with dropout rate 0.25 batch size 256 may be a good fit for our data. Notice here, because of dropout effects, the training loss we recorded are sometimes larger than validation loss. As when training, a certain percentage of nodes are set with weight 0.

Dropout rate 0.5 or 0.25 effect on traning/validating loss

Dropout rate 0.5 or 0.25 effect on traning/validating loss

Expriement 3: Examine LSTM units Effect

In our experiment, we also examed the LSTM units. To be notice, initially follow we chose 120 as our LSTM units, while at the same time, some papers suggest that \(\frac{2}{3(N_i + N_0)}\) could be a basic reference for choosing LSTM hidden nodes, where \(N_i\) is the number of input neurons, and \(N_0\) is the number of output neurons. So in our case, \(\frac{2}{3(N_i + N_0)}\) = 80 unites. So we also train the LSTM model with units 80, In addtion, for exploring purpose, we tried 160 too. It turns out that LSTM units = 120 is still the best choice, since both LSTM=80 and 160 had a higher training and validating loss.

LSTM units effect on traning/validating loss

LSTM units effect on traning/validating loss

Expriement 4: Examine Stack Effect

As Karsten Eckhardt suggestions in his article, choosing the right Hyperparameters for a simple LSTM using Keras, generally, 2 layers have shown to be enough to detect more complex features. More layers can be better but also harder to train. As a general rule of thumb - 1 hidden layer work with simple problems, like this, and two are enough to find reasonably complex features." In our experiment, we just have performed the analysis with LSTM stack = 2, and in this section of experiment we, also exam when LSTM Stack is 1 and 3, and plots show that stack = 2 is the best choice for our model, and it provides the best loss and training and validating steps.

Stack effect on traning/validating loss

Stack effect on traning/validating loss

3.3 Final Quantitative Model:

Here, we conclude our final and optimal model based on the analysis of candidate model performance comparisons. The following three figures show the statistics and architecture of the final model. The final model is a 2-stacked LSTM model with Lstm unit 120, drop out rate 0.25, past window size = 120 and future window size 1. Here, past window size is the number of days that model uses to predict and future window size is the number of days the model forecast. In Addition, the final model is trained by using batch size 256 and epoch 150. Definition of the final quantitative model

The data processing flow chart of associated netLSTM unit structure

3.4 Final Text Model:

We , for S&P500, we tried using US-CHINA news, which is actors in [‘US’, CHINA’] to help prediction. The model structures are shown below, as well as the validation loss comparison, it seems that sentiment news does not improve loss for S&P 500, and are causing signigicant bias. That might because of S&P500 are a lot of companies, only news of US-CHINA relation can not accuarily capture the change in the stock market. Also, for convience and due to a lack of time on this project, the text model is not carefully tuned, the sentiment score are not scaled to the same level of log-S&P 500 index. But for single stock, such as GOOGLE, using the same model, then we found it does improve the validation loss. That is because we are using actors directly associated with google via knowledge graphs. For consistency, we only present the S&P 500 comparison here.

Definition of the final quantitative modelLSTM text data Loss on Training vs. Validating

LSTM text data processing flow

LSTM text data processing flow

IV. Prediction

From Quantitative Model:

In this section, we provide our prediction results for both training and testing dataset. First plot shows the prediction result when fit training is set into the model, while the second plot shows model prediction of the testing set. Our RMSE from the testing data set is 0.0024.

LSTM predictions without text

LSTM predictions without text

From Text Model:

After added the text into the LSTM model, the traning predictions get decreased to 0.0113 by RMSE. However, it have a significent psitive bias in the prediction the S&P500 index when using testing dataset. The prediction RMSE for testing dataset is 0.0207. That shows more work are needed in the furture for us to adjust it.

LSTM predictions with text

LSTM predictions with text

V. Discussion:

In conclusion, our final models have a good ability to predict S&P 500 index. Especially, the LSTM model with only numerical data shows a great power to learn the index future trend. Moreover, we also tried to add text input into this LSTM model with some adjustments on the orginal LSTM architecture. Base on prediction comparsions and comparsions on loss on training steps, we do not find that the extra text input can provide a significant improvment comparing only numerical data is used in terms of predicting S&P 500 index.

LSTM with text vs. without text input

However, we do think there are some room to improve the performance in th efuture studyies. For example, we could try a larger training data set to enhance the learning accuracy and ability of the model, and we can develop a more effeicient way to score the news information, such as, we could just grap financial news’ title instead of scaning the whole news content. Also, we could try spend more time on improving the model structure.

References

[0] A LSTM-based method for stock returns prediction: A case study of China stock market

[1] Alpha Vantage In https://github.com/RomelTorres/alpha_vantage

[2] gdeltPyR In https://github.com/linwoodc3/gdeltPyR

[3] Le, Xuan-Hien; Ho, Hung Viet Application of Long Short-Term Memory (LSTM) Neural Network for Flood Forecasting Water 2019, 11(7), 1387;, 2019.

[4] Chen, K., Zhou, Y., Dai, F. A lstm-based method for stock returns prediction: A case study of china stock market. InIEEE International Conference on Big Data (Big Data). pp. 2823–2824, 2015

[5] Abhishek Nan, Anandh Perumal, Osmar R. Zaiane Sentiment and Knowledge Based Algorithmic Trading with Deep Reinforcement Learning InarXiv:2001.09403v1, 2020